NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Unveiling the Dynamics of Human Mobility in Response to Wildfire-Induced Air Quality Degradation: An Examination of the 2019 Kincade Fire

https://doi.org/10.1061/JMENEA.MEENG-6340

Shen, Xianglu; Zhang, Huixin; Wang, Yanzhi; Wang, Qi R (May 2025, Journal of Management in Engineering)

Full Text Available
Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices

https://doi.org/10.1145/3747842

Niu, Wei; Sun, Mengshu; Li, Zhengang; Chen, Jou-An; Guan, Jiexiong; Shen, Xipeng; Liu, Jun; Zhang, Mei; Wang, Yanzhi; Lin, Xue; et al (July 2025, ACM Transactions on Architecture and Code Optimization)

It is challenging to deploy 3D Convolutional Neural Networks (3D CNNs) on mobile devices, specifically if both real-time execution and high inference accuracy are in demand, because the increasingly large model size and complex model structure of 3D CNNs usually require tremendous computation and memory resources. Weight pruning is proposed to mitigate this challenge. However, existing pruning is either not compatible with modern parallel architectures, resulting in long inference latency or subject to significant accuracy degradation. This paper proposes an end-to-end 3D CNN acceleration framework based on pruning/compilation co-design called Mobile-3DCNN that consists of two parts: a novel, fine-grained structured pruning enhanced by a prune/Winograd adaptive selection (that is mobile-hardware-friendly and can achieve high pruning accuracy), and a set of compiler optimization and code generation techniques enabled by our pruning (to fully transform the pruning benefit to real performance gains). The evaluation demonstrates that Mobile-3DCNN outperforms state-of-the-art end-to-end DNN acceleration frameworks that support 3D CNN execution on mobile devices, Alibaba Mobile Neural Networks and Pytorch-Mobile with speedup up to 34 × with minor accuracy degradation, proving it is possible to execute high-accuracy large 3D CNNs on mobile devices in real-time (or even ultra-real-time).
more » « less
Full Text Available
Reducing Unfairness in Distributed Community Detection

https://doi.org/10.1109/ICDM59182.2024.00121

Zhang, Hao; Jayaweera, Malith; Ren, Bin; Wang, Yanzhi; Soundarajan, Sucheta (December 2024, IEEE)

Full Text Available
DEFCON: Deformable Convolutions Leveraging Interval Search and GPU Texture Hardware

https://doi.org/10.1109/IPDPS57955.2024.00063

Jayaweera, Malith; Li, Yanyu; Wang, Yanzhi; Ren, Bin; Kaeli, David (May 2024, IEEE)

Full Text Available
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

https://doi.org/10.1109/TCAD.2024.3443692

Dong, Peiyan; Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Li, Yanyu; Xu, Dongkuan; Huang, Heng; Hu, Jingtong; Jones, Alex K; Shi, Yiyu; et al (November 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

https://doi.org/10.1109/CVPR52733.2024.00827

Li, Zhengang; Kang, Yan; Liu, Yuchen; Liu, Difan; Hinz, Tobias; Liu, Feng; Wang, Yanzhi (June 2024, IEEE)

Full Text Available
Energy-Aware Tile Size Selection for Affine Programs on GPUs

https://doi.org/10.1109/CGO57630.2024.10444795

Jayaweera, Malith; Kong, Martin; Wang, Yanzhi; Kaeli, David (March 2024, IEEE)
Waxing-and-Waning: a Generic Similarity-based Framework for Efficient Self-Supervised Learning

Li, Sheng; Wu, Chang; Li, Ao; Wang, Yanzhi; Tang, Xulong; Yuan, Geng (May 2024, 2024 International Conference on Learning Representations)

Full Text Available
Waxing-and-Waning: a Generic Similarity-based Framework for Efficient Self-Supervised Learning

Li, Sheng; Wu, Chang; Li, Ao; Wang, Yanzhi; Tang, Xulong; Yuan, Geng (May 2024, 2024 International Conference on Learning Representations)

Full Text Available
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

Dong, Peiyan; Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Li, Yanyu; Xu, Dongkuan; Huang, Heng; Hu, Jingtong; Jones, Alex K; Shi, Yiyu; et al (October 2024, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS)

While Vision Transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (< 1 ms) is challenging. Current computing platforms like CPUs, GPUs, or FPGA-based solutions struggle to meet this deterministic low-latency real-time requirement, even with quantized ViT models. Some approaches use pruning or sparsity to reduce model size and latency, but this often results in accuracy loss. To address the aforementioned constraints, in this work, we propose EQ-ViT, an end-to-end acceleration framework with novel algorithm and architecture co-design features to enable real-time ViT acceleration on AMD Versal Adaptive Compute Acceleration Platform (ACAP). The contributions are four-fold. First, we perform in-depth kernel- level performance profiling & analysis and explain the bottlenecks for existing acceleration solutions on GPU, FPGA, and ACAP. Second, on the hardware level, we introduce a new spatial and heterogeneous accelerator architecture, EQ-ViT architec- ture. This architecture leverages the heterogeneous features of ACAP, where both FPGA and artificial intelligence engines (AIEs) coexist on the same system-on-chip (SoC). Third, On the algorithm level, we create a comprehensive quantization-aware training strategy, EQ-ViT algorithm. This strategy concurrently quantizes both weights and activations into 8-bit integers, aiming to improve accuracy rather than compromise it during quanti- zation. Notably, the method also quantizes nonlinear functions for efficient hardware implementation. Fourth, we design EQ- ViT automation framework to implement the EQ-ViT architec- ture for four different ViT applications on the AMD Versal ACAP VCK190 board, achieving accuracy improvement with 2.4%, and average speedups of 315.0x, 3.39x, 3.38x, 14.92x, 59.5x, 13.1x over computing solutions of Intel Xeon 8375C vCPU, Nvidia A10G, A100, Jetson AGX Orin GPUs, and AMD ZCU102, U250 FPGAs. The energy efficiency gains are 62.2x, 15.33x, 12.82x, 13.31x, 13.5x, 21.9x.
more » « less
Full Text Available

« Prev Next »

Search for: All records